Consensus for the Fip35 folding mechanism? 
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Recent advances in computational power and simulation programs finally delivered the first ex- 
amples of reversible folding for small proteins with an all-atom description. But having at hand the 
atomistic details of the process did not lead to a straightforward interpretation of the mechanism. 
For the case of the Fip35 WW-domain where multiple long trajectories of 100 /xs are available from 
D. E. Shaw Research, different interpretations emerged. Some of those are in clear contradiction 
with each other while others are in qualitative agreement. Here, we present a network-based analysis 
of the same data by looking at the local fluctuations of conventional order parameters for folding. 
We found that folding occurs through two major pathways, one almost four times more populated 
than the other. Each pathway involves the formation of an intermediate with one of the two hairpins 
in a native configuration. The quantitative agreement of our results with a state-of-the-art reaction 
coordinate optimization procedure as well as qualitative agreement with other Markov-state-models 
and different simulation schemes provides strong evidence for a multiple folding pathways scenario 
with the presence of intermediates. 



INTRODUCTION 

Computer models of protein folding have been around 
now for almost two decades [U [2] ■ Such models evolved 
dramatically, going from simple lattice models 3J and im- 
plicit solvent implementations "T to fully atomistic calcu- 
lations [S] . In the quest for microscopic models of protein 
folding, several groups focused on small proteins like WW 
domains which can be very informative yet much easier 
to treat. Recently, an important breakthrough in this di- 
rection was delivered by D. E. Shaw Research. By using 
in-house technology with optimized software they deliv- 
ered first examples of reversible folding for some mini- 
proteins [6] . For the case of the Fip35 WW domain they 
made available two trajectories each of which of 100 /is 
in length. These calculations raised a number of contro- 
versial interpretations on the actual folding mechanism. 

For this system, no agreement was found on whether 
the folding process proceeds via on-pathway intermedi- 
ates or downhill folding. The original work applied an 
optimization procedure to obtain a reaction coordinate 
for the folding process [7j, concluding that no relevant 
barriers are present in the folding process [HI . This inter- 
pretation was challenged by others. In particular, Krivov 
demonstrated the presence of intermediates by calculat- 
ing a novel optimized reaction coordinate [S]. This view 
was also supported, at least at a qualitative level, by 
Pande and Baker groups which independently analyzed 
the same data with Markov-state-models [51 [TU] . 

All approaches have found that the folding process is 
more likely to proceed via the formation of the first hair- 
pin (/3i) followed by the second one {(32) but the con- 
ceptual disagreement on the presence of intermediates 
or a downhill scenario is pretty strong. Unfortunately, 
these types of analysis [6, iSnlO] are not very intuitive, 
making an objective evaluation of the results hard. On 
one side, optimization procedures like the ones applied 



by Shaw and Krivov, tend to hinder the physical mean- 
ing of the obtained reaction coordinates while clustering 
of high-dimensional spaces strongly suffer from thermal 
fluctuations [11] and limited sampling [lO] . 

In an effort to bridge the gap between the use of more 
intuitive coordinates and the application of Markov- 
state-models, we recently extended an approach derived 
from single-molecule spectroscopy to study molecular 
simulations jl2|. Aiming at identifying the most robust 
features of Fip35 folding, we present here an extension of 
this framework for the analysis of D. E. Shaw data. The 
entire analysis is based on conventional order parameters 
time series, like root mean square deviations (RMSD). 
This overcomes the problem of working in complex mul- 
tidimensional spaces as in the case of coordinate opti- 
mizations [6l [8] , k- means clustering [9] or contact maps 
comparisons [lOj . Still, the approach provides a good 
assessment of the kinetics thanks to the application of 
complex networks analysis, allowing the development of 
simple Markov-state-models reproducing the time scales 
of the original MD trajectory [TTl[T2]. Our results rein- 
force the interpretation that Fip35 folding proceeds via 
intermediates. 



METHODS 

Markov state models from conventional order 
parameter analysis 

In this section we are going to explain the tools that 
were used to analyze the Fip35 data set. The main idea 
behind local fluctuations analysis is to build a Markov- 
state-model using as input the time series of a general or- 
der parameter. This is done by looking at the fluctuations 
of the coordinate within a predetermined time window. 
The approach was initially developed for single molecule 
experiments [13H15| , but in a recent paper we applied and 



extended this technique to study conventional order pa- 
rameter time series from molecular simulations |12) . The 
main motivation was to develop a tool to analyze those 
time series in a more rigorous way, going beyond straight- 
forward histogram analysis. In fact, the great advantage 
of this strategy is to characterize order parameter time 
series on the base of the kinetics, something that was 
definitively impossible by using free-energy projections 
and histogram analysis. The approach takes advantage 
of the work done in complex network analysis |161 117] 
and Markov-state-models |T5] in molecular systems but 
it overcomes some of the problems, e.g. avoiding to work 
in highly dimensional spaces. The downside here is that 
the framework is based on simple coordinates, therefore 
if they miss some relevant aspects of the system there is 
no way to recover that type of information. 

The steps to be covered are very similar to any other 
Markov-state-model: (i) microstate building, (ii) transi- 
tion network building, (iii) kinetic lumping. Although al- 
ready discussed elsewhere in detail [T2[Tni[T7], below we 
provide the essentials to better follow the paper. A code 
to reproduce the presented analysis is freely distributed 
at the website raolab . com. 



Microstate building 

Microstates were defined for every trajectory snapshot 
by looking at the local fluctuations of the order parameter 
coordinate within a time window t^ centered in the snap- 
shot itself OE]. Two time points belonged to the same 
microstate if they had comparable distributions of the 
order parameter within t^ according to a Kolmogorov- 
Smirnov test [H]. That is, if the condition D < C\/2/itu 
was fulfilled, where D is the maximum difference of the 
two cumulative distributions and ( corresponds to a cer- 
tain confidence level. Being t^, and C related, we fixed 
the latter value to 0.5 and let t^ vary as done in Ref. 
[T^ . Comparisons were made along the trajectory using 
the leader algorithm in a way that every time point was 
associated to a microstate [TH |20] ■ As shown by others 
[l4] and us [12], the methodology is robust for a reason- 
able wide range of time windows. For example, for values 
of the time window up to 13 ns the mean first passage 
time to the folded state steadily increases till a plateau. 
Then, between 13 and 24 ns the mean first passage time 
fluctuates around 4.6 /xs. Finally for larger windows, this 
value tends to slightly increase together with larger fluc- 
tuations but still in agreement with previous calculations 
[5]. In the following, we fix the time window to 18 ns. In 
the general case, short time windows are unable to cap- 
ture significantly well the coordinate distribution leading 
to faster kinetics [12] while long ones result in too many 
fast fluctuations to neighboring states inside a single dis- 
tribution. Given the /iS time scales of the folding process, 
this would happen only in cases when windows of hun- 



dreds of ns are selected. 



The configuration-space-network 

The resulting time series of microstates was mapped 
onto a configuration-space- network [T^ (TBI [IZl HU \TI\ . 
Microstates represent network nodes and a link between 
them exists if they were successively visited along the 
molecular trajectory. Detailed balance was imposed for 
each link by making an average of the number of tran- 
sitions in both directions. This was only partially nec- 
essary because the original trajectories mostly satisfied 
detailed balance already. 



Kinetic lumping by network clusterization 

Protein conformational states were defined by apply- 
ing a kinetic lumping scheme. This was done by running 
a clusterization procedure on the configuration-space- 
network, the Markov-Clustering- Algorithm (MCL) [23j . 
This approach assures that the obtained network clusters 
represent meaningful free-energy basins with preserved 
system kinetics [12l [17] . Being interested on the char- 
acterization of the folding mechanism, a granularity pa- 
rameter of 1.2 was used to focus on the highest barriers 
only [T^ [T7] . The obtained states were used to build a 
Markov-state-model, schematically representing all rele- 
vant slow transitions of the system. 



First passage time distributions 

The kinetic similarity between the original trajectory 
and the Markov-state-model was investigated by com- 
paring the distribution of the first-passage-times (fpt) 
[TTl [T^ to the folded state. This is the distribution of 
times to reach the native state from any other snapshot 
of the trajectory [TT]. For the Markov-state-model the 
fpt was calculated on trajectories generated by running 
a random walk. Arrival times depend on the definition 
of the target only and not on the detailed decomposition 
of the trajectory. For this study we used the definition 
of the native state as obtained by MCL. 



Molecular simulation data 

The simulation data was directly obtained from D. E. 
Shaw Research and published in Ref. [B] It is an all- 
atom molecular dynamics simulation in explicit water 
(TIP3P water model) at the protein's in silico melting 
temperature {395K). It was calculated using the Anton 
supercomputer with the modified Amber ff99SB-ILDN 
force field [24 carried out in the NVT ensemble using the 
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FIG. 1. Projected free-energy surfaces of Fip35: (a) as a 
function of the RMSD coordinate Rail (backbone atoms from 
residues 7 — 29); (b) as a function of the RMSD coordinates 
i?^i (residues 7— 23) and -R;32 (residues 18— 29). Free energies 
are expressed in units of kT. 



Nose-Hoover thermostat with a relaxation time of 1.0 ps 
[B, 24 . The simulation data consisted of two trajectories, 
each of length 100 ^s. 



RESULTS 

RMSD analysis of the folding mechanism: 
identification of putative intermediate states 

Conventional order parameters for protein folding in- 
clude RMSD with respect to the native state [3S], num- 
ber of native contacts |4^ or radius of gyration [26] . 
RMSD is certainly one of the most obvious choices when 
it comes to monitor folding to a known structure be- 
cause it requires minimal a priori knowledge (i.e. the 
native structure). The projected free-energy landscape 
onto the RMSD from the native state (Rail) shows two 
clear minima, corresponding to the folded and unfolded 
states (Fig.flk). To improve on this simple description we 
projected the landscape onto two coordinates instead of 
one. Given the triple stranded topology of Fip35, we ex- 
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FIG. 2. A time series stretch of Fip35. The RMSD for the 
whole structure Raii is shown in black. Rp^ and -R^j are 
shown as gray Hnes in the top and bottom panels, respec- 
tively. Gray band regions represent portions of the trajectory 
where one of the two hairpins is native while the other one is 
unfolded. 



pect that the RMSD coordinates from the first (residues 
7 — 23, i?/3i) and second (residues 18 — 29, i^^a) hairpin 
to provide additional information on the process. Fig. [lb 
shows a 2D projection onto these two new coordinates. 
The folded and unfolded states are clearly visible at re- 
gions around (1, 1) and (7,5) respectively. In agreement 
with the ID projection, this plot provides some further 
information on the presence of other states like the darker 
regions at around (2, 1) and (1, 3.5). Those regions might 
represent intermediate steps to the folding process, a 
property that cannot be validated by Fig. [i]d due to the 
lack of information on the kinetics [TBI [27| . 



The relevance of these regions for the folding pro- 
cess emerged by the inspection of some of the fold- 
ing/unfolding events. One example is shown in Fig. [2] 
In the two panels, the RMSD with respect to the na- 
tive state Rail, first hairpin {Rp-^, top panel) and second 
hairpin (i?^^ , bottom panel) are shown as black and gray 
lines, respectively. This picture shows that at around 
30.3 ^s an unfolding event is present. Here Rail rapidly 
increases (black line) as well as Rjs^ (gray, bottom panel) . 
This is not the case for 7?/3j (gray, top panel) which stays 
low for around 200 ns while the other hairpin is un- 
folded (the region is highlighted with a gray band). A 
similar behavior was observed for the folding event at 
31.1 fis, with the second hairpin being native for a time 
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FIG. 3. Markov-state-model. States were obtained by a lo- 
cal fluctuations analysis of Rp-^ and Rp^ followed by a kinetic 
lumping via network clusterization (see Methods for details). 
Total number of transitions and state populations are indi- 
cated. 
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span of roughly 150 ns before the complete folding event 
(right gray band). In the folding/unfolding process the 
two hairpins evolve in an uncorrelated manner. Con- 
sequently, the RMSD coordinates Rp^ and i?^^ provide 
independent information on the folding mechanism, sug- 
gesting the presence of on-pathway intermediates. It is 
important to note, while the total RMSD is able to re- 
port on the presence of these states (the RMSD in the 
gray band regions of Fig. [2] is lower than the completely 
unfolded state), this coordinate does not have the sen- 
sitivity to discriminate between partially folded states 
with the first hairpin formed from the ones with the sec- 
ond hairpin formed. For this reason, we do not think that 
Rail represents a good coordinate for a local fluctuations 
analysis. 



Local fluctuations analysis: kinetics assessment of 
the intermediate states 



Given these observations, we chose Rp^ and Rp^ coor- 
dinates as probes for a more insightful kinetic analysis of 
the folding mechanism. This was done by performing a 
joint local fluctuations analysis of these coordinates (see 
the Methods section for details). Being this approach 
developed for a single coordinate, here we extended the 
framework to account for multiple order parameters, such 
as Rp^ and Rp^. To do so, the local fluctuations of each 
coordinate were first analyzed separately using the stan- 
dard approach. Then, the obtained states for each coor- 
dinate were merged into a set of "combined" states. That 
is, given at a certain time the states corresponding to Rp-^ 
and Rp^ being respectively A and B, the new combined 
state is {A^B). This strategy includes the contribution 
of two coordinates in a simultaneous way. 

Application of this technique resulted in the identifica- 



FIG. 4. First-passage-time distributions to the native state. 
The black line corresponds to the first passage time distribu- 
tion along the original MD trajectory. The gray line together 
with the light gray region respectively show the average first 
passage time value and its standard deviation obtained by 10^ 
random walks of length lO'^ steps on the Markov-state-model 
of Fig. [3] An exponential fit of the data showed a relaxation 
time of 4.3 ^s which is in good agreement with the value es- 
timated from trajectory [8]. 



tion of six states with a population larger than (or equal 
to) 1.0%. The cumulative population of these states is of 
about 99.3%, indicating that they well characterize the 
sampled conformational space. In this representation the 
native (N) and unfolded states (U) have a population of 
59.0% and 32.0%, respectively. The remaining four states 
have a much smaller population of few percents. The six 
states were used to build a reduced Markov-state-model 
whose transition network is shown in Fig. l3] Interest- 
ingly, all states stay on-pathway from U to N. Specifically, 
two major folding intermediates were found just preced- 
ing the fully folded state: II and 12. These intermediates 
correspond to two independent folding routes with dif- 
ferent relative populations. We calculated this explicitly 
by looking along the original trajectory which states were 
preceding the folding state. Of the total 10 folding events 
7 and 2 events followed the II and 12 routes, respectively. 
A third pathway was followed only once. 

To check that the reduced Markov model of Fig. [3] was 
able to correctly reproduce the original dynamics of the 
MD trajectory, a first passage time analysis to the native 
state was computed. In Fig. |4] the distributions of the 
first passage times corresponding to the original trajec- 
tory and the six-states Markov model are shown as black 
and gray lines, respectively. Interestingly, the two dis- 
tributions present a very similar decay in the long times 
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FIG. 5. Projected free-energy surface of Fip35 for the native, 
unfolded and the two intermediate states as found by the local 
fluctuation analysis. Free energies are expressed in units of 
kT. 



regime, corresponding to a folding time of around 4.3 /iS. 
The ability of the Markov model to reasonably reproduce 
the folding time of the original trajectory is remarkable. 
Specifically, it indicates that the kinetic lumping via net- 
work clusterization correctly partitioned the whole free- 
energy landscape. When this would not be the case [TT], 
a much faster kinetics usually appears [12l [27] . 



Structural analysis of Fip35 folding intermediates 

In Fig. [5] a free-energy projection of the native, un- 
folded, II and 12 states onto the Rp^ and Rp^ coordi- 
nates is shown. The four states occupy well defined re- 
gions of the map. However, the distributions of inter- 
mediate states are rather broad producing large overlaps 
with both the native and unfolded states (compare also 
with Fig. IT] and check references [111 HZ]). Besides this, 
the two intermediate states have a good degree of native- 
ness: for the case of II (12) the value of Rp^ [Rpi) ^'"^ 
most of the time below 2 A. 

From a structural point of view, the two interme- 
diates are characterized by well-defined conformations. 
Structural superpositions of the native and intermedi- 
ate states are shown in Fig. [6] (while the unfolded state 
looks like a disordered bundle of structures). The three 
states present a reasonable amount of structural homo- 
geneity. For the case of II, the first hairpin is in its 
native conformation as well as the turn connecting the 
two hairpins. On the other hand the second hairpin is 
unstructured, interacting with the rest of the protein in 



several non-specific ways. In the 12 intermediate, an in- 
verted situation was found where the second hairpin is 
native. In this case however, the first hairpin is prone to 
fold into a specific configuration instead of being unstruc- 
tured. This conformation resembles the native structure 
but the formation of the first turn, and consequently the 
entire hairpin, is shifted by one residue (out-of-register 
conformations are typical in /3-hairpins |16l). 

It is striking to see the different amount of disorder in 

11 and 12 (Fig. ]6| , suggesting a qualitative reason for the 
different statistical relevance of the two folding pathways. 
In fact, folding through II involves the formation of new 
specific contacts in the P2 region while folding through 

12 first requires the disrupture of a number of non-native 
contacts in the /3i region due to the out-of-register con- 
formation. As such, 12 might even be considered per se 
a misfolded structure. 



DISCUSSION 

Complex network analysis of Fip35 RMSD local fluc- 
tuations provided evidence for three main observations: 
(i) beyond the native and unfolded states, hidden states 
were detected; (ii) among those states, two on-pathway 
intermediates for folding were found; (lit) the different 
amount of structural disorder in the two intermediates 
suggest a reason for the prelevance of one pathway with 
respect to the other. 

Previous calculations based on Markov-state-models 
found multiple pathways and a heterogeneous molecular 
mechanism. In contrast to us, structural clustering [5] 
or likelihood methods in conjunction with contact maps 
[TU] were used to build the Markov models. Although it 
is difficult to compare these approaches in a quantitative 
way, their predictions are in qualitative agreement with 
our results. Interestingly, the use of an alternative simu- 
lation protocol to probe slow conformational transitions 
confirmed the presence of two main folding pathways [3H] . 
Given the triple stranded native topology of a WW do- 
main, the presence of these two pathways is not new. The 
same folding routes were already observed in the past for 
a 20 residues triple stranded /3-sheet peptide in implicit 
solvent [JlfTB]. 

So far, one of the most robust interpretations of Fip35 
folding was provided by the analysis of Krivov [5] . Our 
findings are in excellent agreement with that study. This 
is quantitatively shown in FigJT] where we projected our 
states to the optimized coordinate developed in that 
work. This comparison reveals that the two approaches 
provide very similar results. In Krivov's profile the na- 
tive, II and unfolded states were identified as peaks of the 
probability distribution JSJ. Strikingly, the distributions 
arising from our detected states overlap very well with 
these peaks. For the native state, only a very small frac- 
tion of 0.7% was found in the wrong part of the profile 





FIG. 6. Structural superpositions of the native and the two intermediate states as found by the local fluctuation analysis. 
Each panel contains 25 randomly chosen frames. Structures were superimposed on residues 7—23 and 18 — 29 for II and 12, 
respectively. 



(green peak between the value 18 and 26 of the coordi- 
nate) while for II there is a perfect agreement. Moreover, 
the second intermediate 12, which originally could not be 
directly detected from the profile, was found in the same 
position as predicted in Ref. [S] (12 was hidden because 
parallel pathways cannot be simultaneously displayed in 
this representation). 

Overall, the two approaches provided the same mech- 
anistic understanding. This is encouraging given the 
strong diversity of the two methods. In fact, Krivov's 
reaction coordinate was obtained via an optimization 
procedure starting from an educated guess, i.e. a lin- 
ear combination of conventional (non-optimized) coordi- 
nates. The procedure makes a parameter space search 
minimizing the flux on top of the barriers of the ini- 
tial free-energy projection. Better results were obtained 
when using several (i.e. thousands) of coordinates, e.g. 
all pairwise inter-atomic distances in a protein [8l I29j . 
Unfortunately, this makes the optimization procedure 
highly non-trivial "^U]. The strength of this method 
is to provide kinetically meaningful free-energy profiles 
with diffusive dynamics. The downsides are the intrinsic 
limitations of ID profiles to describe parallel pathways 
and the non-trivial optimization procedure. The com- 
plex network analysis presented here does not run any 
optimization algorithm but attempts to detect hidden 
states from the time series of a generic coordinate, e.g. 
the RMSD. This makes our method much faster. The 
downsides are again the need of an educated guess for 
the selection of the coordinate to use as well as poten- 
tially larger errors in the resulting kinetic models. 

In conclusion, most of the results presented so far on 
Fip35 folding point to the same direction. We believe 
that the original interpretation provided by Shaw and 
co-workers of downhill folding is due to an inappropriate 
application of the reaction coordinate optimization they 



used. Being based on commitor probabilities, that ap- 
proach was designed and tested for two-state processes 
only [7 . Given the implicit assumption of a two-state 
scheme, application of this protocol to multi-state sys- 
tems leads to inaccurate results. 
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